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Abstract 

Cloud computing is appealing from management and 
efficiency perspectives, but brings risks both known and 
unknown. Well-known and hotly-debated information 
security risks, due to software vulnerabilities, insider 
attacks, and side-channels for example, may be only 
the "tip of the iceberg." As diverse, independently de- 
veloped cloud services share ever more fluidly and ag- 
gressively multiplexed hardware resource pools, unpre- 
dictable interactions between load-balancing and other 
reactive mechanisms could lead to dynamic instabili- 
ties or "meltdowns." Non-transparent layering struc- 
tures, where alternative cloud services may appear in- 
dependent but share deep, hidden resource dependen- 
cies, may create unexpected and potentially catastrophic 
failure correlations, reminiscent of financial industry 
crashes. Finally, cloud computing exacerbates already- 
difficult digital preservation challenges, because only 
the provider of a cloud-based application or service can 
archive a "live," functional copy of a cloud artifact and 
its data for long-term cultural preservation. This pa- 
per explores these largely unrecognized risks, making 
the case that we should study them before our socioeco- 
nomic fabric becomes inextricably dependent on a con- 
venient but potentially unstable computing model. 

1 Introduction 

Attractive features and industry momentum make cloud 
computing appear destined to be the next dominant com- 
puting paradigm. Cloud computing is appealing due to 
the convenience of central management and the elastic- 
ity of resource provisioning. Moving critical informa- 
tion infrastructure to the cloud also presents risks, how- 
ever, some of which are well-known and already hot 
research topics. The much-discussed challenge of en- 
suring the privacy of information hosted in the cloud, 
for example (5), has resulted in an emerging breed of 
"cloud-hardened" virtualization hardware [8 1 and secu- 
rity kernels ll23l . Similarly, the challenge of ensuring 
high availability in the cloud has in part fueled recent 
research on robust data center networking [14. 20 1. 

This paper assumes that a large fraction of the com- 
puting industry is, for better or worse, "moving to the 
cloud," and that current research addressing the immedi- 
ate information security risks is well underway and will 



(eventually) succeed. Setting aside these known chal- 
lenges, therefore, this paper attempts to identify and fo- 
cus on several less well-understood — and perhaps less 
"imminent" — risks that may emerge from the shift to 
cloud computing. In particular, this paper addresses: (1) 
stability risks due to unpredictable interactions between 
independently developed but interacting cloud computa- 
tions; (2) availability risks due to non-transparent lay- 
ering resulting in hidden failure correlations; and (3) 
preservation risks due to the unavailability of a cloud 
service's essential code and data outside of the provider. 
This paper is speculative and forward-looking; the au- 
thor cannot yet offer definitive evidence that any of these 
risks will fully materialize or become vitally important, 
but rather can offer only informal arguments and anec- 
dotal evidence that these risks might become important 
issues. The above list is also probably incomplete: it is 
likely that other important risks will emerge only as the 
industry continues its shift to the cloud. Nevertheless, 

1 argue that it is worth proactively investigating longer- 
term risks such as these before they are certain or im- 
minent, as the stakes may be high. Further, once any 
of these risks do become important, it may be too late 
to reconsider or slow the movement of critical infras- 
tructure to the cloud, or to rethink the architecture of 
important cloud infrastructure or services once they are 
already perceived as "mature" in the industry. 

Section |2] addresses stability risks, Section|3]explores 
availability risks, and Section |4] explores preservation 
risks. Section [5] briefly points out a few possible re- 
search directions in which solutions might be found — 
though this paper cannot and does not pretend to offer 
"answers." Finally, Section|6]concludes. 

2 Stability Risks from Interacting Services 

Cloud services and applications increasingly build atop 
one another in ever more complex ways, such as cloud- 
based advertising or mapping services used as compo- 
nents in other, higher-level cloud-based applications, all 
of these building on computation and storage infrastruc- 
ture offered by still other providers. Each of these in- 
teracting, codependent services and infrastructure com- 
ponents is often implemented, deployed, and maintained 
independently by a single company that, for reasons of 
competition, shares as few details as possible about the 
internal operation of its services. The resource provi- 
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Figure l : Example instability risk from unintended cou- 
pling of independently developed reactive controllers 

sioning and moment-by-moment operation of each ser- 
vice is often managed by dynamic, reactive control pro- 
cesses that constantly monitor the behavior of customer 
load, internal infrastructure, and other component ser- 
vices, and implement complex proprietary policies to 
optimize the provider's cost-benefit ratio. 

Each cloud service's control loop may change the 
service's externally visible behavior, in policy-specific 
ways, based on its neighboring services' behavior, cre- 
ating cyclic control dependencies between interacting 
cloud services. These dependency cycles may lead to 
unexpected feedback and instability, in much the way 
that policy-based routing in BGP is already known to 
lead to instability or "route flapping" in the much more 
restricted "control domain" of Internet routing [6]|T8l. 

To illustrate this risk, we consider a minimalistic, per- 
haps contrived, but hopefully suggestive example in Fig- 
ure Q] Application provider A develops and deploys a 
cloud-based application, which runs on virtual compute 
and storage nodes from infrastructure provider B. For 
simplicity, assume A leases two virtual nodes from B, 
and dynamically load-balances incoming requests across 
the web/application servers running on these nodes. As- 
sume A's load balancer operates in a control loop with 
a 1 -minute period: after each minute it evaluates each 
server's current load based on that server's response time 
statistics during the past minute, and shifts more traffic 
during the next minute to the less-loaded server. Assume 
that A's load shifting algorithm is well-designed and sta- 
ble assuming the servers in the pool behave consistently 
over time, like dedicated physical servers would. 

Unbeknownst to A, however, suppose B also runs a 
control loop, which attempts to optimize the power con- 
sumption of its physical servers by dynamically adjust- 
ing the servers' clock rates based on load. This control 
loop also happens to have a 1 -minute period: after each 
minute, -B's controller measures each CPU core's uti- 
lization during the past minute, then reduces the core's 
voltage and speed if the core was underutilized or in- 
creases voltage and speed if the core was overutilized. 



Again, assume that B's controller is well-designed and 
stable assuming that the servers' load stays relatively 
constant or varies independently of B's control actions. 

Although both A's and B's control loops would be 
stable if operating alone, by the misfortune of their en- 
gineers (independently) picking similar control loop pe- 
riods, the combination of the two control loops may risk 
a positive feedback loop. Suppose during one minute 
the load is slightly imbalanced toward virtual server 1, 
and the two control loops' periods happen to be closely 
aligned; this will happen sooner or later in the likely 
event their clocks run at slightly different rates. A's load 
balancer notices this and shifts some load away from the 
node in the next minute, while B's power optimizer no- 
tices the same thing and increases the node's voltage and 
clock speed. While either of these actions alone would 
lead toward convergence, the two in combination cause 
overcompensation: during the next minute, server 1 be- 
comes more underutilized than it was overutilized in the 
previous minute. The two controllers each compensate 
with a stronger action — a larger shift of traffic back to 
server 1 by A and a larger decrease in voltage and clock 
speed by B — causing a larger swing the next minute. 
Soon all incoming load is oscillating between the two 
servers, cutting the system's overall capacity in half — or 
worse, if more than two servers are involved. 

This simplistic example might be unlikely to occur in 
exactly this form on real systems — or might be quickly 
detected and "fixed" during development and testing — 
but it suggests a general risk. When multiple cloud ser- 
vices independently attempt to optimize their own oper- 
ation using control loops that both monitor, and affect, 
the behavior of upstream, downstream, or neighboring 
cloud services, it is hard to predict the outcome: we 
might well risk deploying a combination of control loops 
that behaves well "almost all of the time," until the emer- 
gence of the rare, but fatal, cloud computing equivalent 
of the Tacoma Narrows Bridge ll3l[T0l. 

Comparable forms of "emergent misbehavior" have 
been observed in real computing systems outside of the 
cloud context [11], and some work has studied the chal- 
lenge of coordinating and stabilizing multiple interact- 
ing control loops, such as in power management lTT3l . 
Current approaches to solving or heading off such insta- 
bility risks, however, generally assume that some single 
engineer or company has complete information about, 
and control over, all the interacting layers and their con- 
trol loops. The cloud business model undermines this 
design assumption, by incentivizing providers not to 
share with each other the details of their resource alloca- 
tion and optimization algorithms — crucial parts of their 
"secret sauce" — that would be necessary to analyze or 
ensure the stability of the larger, composite system. 
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Figure 2: Cloud service stack illustrating risks of corre- 
lated failures due to hidden service interdependencies 

3 Risks of Hidden Failure Correlations 

Ensuring high availability is usually a high priority for 
cloud infrastructure and services, and state replication 
and fault tolerance mechanisms is the focus of much 
industry and research attention. Most of this attention 
is focused -within a particular cloud service, however. 
In addition to the stability risks discussed above, inter- 
actions between multiple interdependent cloud services 
could lead to availability risks not yet addressed in main- 
stream research, where hardware infrastructure interde- 
pendencies hidden by proprietary business relationships 
can lead to unexpected failure correlations. 

As another contrived but illustrative example, con- 
sider the "cloud service stack" in Figure |2] The provider 
at the top offers a cloud-based application intended to 
offer mission-critical reliability. To ensure this reliabil- 
ity, the application replicates all critical application state 
across the storage services provided by two nominally- 
independent cloud storage providers, A and B, each of 
which in turn provides storage at multiple geographic 
sites with separate network connectivity at each site. 

Unbeknownst to the application provider, however, 
each storage provider obtains its network connections 
from a common underlying network provider, C. The 
application's access to its critical storage proves highly 
reliable as long as provider C operates normally. If 
provider C encounters a rare disaster or administrative 
glitch, however — or enters a dispute with another top- 
tier network provider [4 1 — the mission-critical applica- 
tion may suddenly lose connectivity to both of its crit- 
ical storage repositories. This correlated failure results 
from the shared dependencies on C being hidden by the 
proprietary business relationships through which the ap- 
plication provider obtains services from A and B. 

As the cloud computing industry matures and pro- 
duces ever more complex cloud-based services, it seems 
inevitable that the depth and complexity of inter-service 
relationships will continue to explode, which may create 
unpredictable availability risks due to ever more subtle 
cross-layer interdependencies, of which the above ex- 
ample is merely the most simplistic representative. Fur- 
thermore, one of the fundamental attractions of cloud 
computing is that it makes computing infrastructure, ser- 
vices, and applications into generic, almost arbitrarily 



"fungible" resources that can be bought, sold, and resold 
as demanded by business objectives ll2p . 

It does not seem far-fetched to predict that cloud ser- 
vices will arise that represent a thin veneer over, or 
"repackaging" of, other services or combinations of ser- 
vices: e.g., businesses that resell, trade, or speculate on 
complex cocktails or "derivatives" of more basic cloud 
resources and services, much like the modern financial 
and energy trading industries operate. If this prediction 
bears out, the cloud services industry could similarly 
start yielding speculative bubbles and occasional large- 
scale failures, due to "overly leveraged" composite cloud 
services whose complex interdependencies hide corre- 
lated failure modes that do not become apparent until 
the bubble bursts catastrophically — perhaps not wholly 
unlike the causes of the recent financial meltdown or the 
earlier Enron energy bubble [7 1. Once again, while this 
risk is pure speculation at this point, it seems worth tak- 
ing seriously and exploring in advance. 

4 Digital Preservation Risks 

The final risk considered here is more long-term. With 
the tremendous economic momentum toward cloud- 
based and cloud-dependent applications and services, 
it appears inevitable that these cloud-based "digital ar- 
tifacts" will soon represent a considerable and ever- 
increasing component of our social and cultural heritage. 
In 100 years, however, will today's culturally important 
cloud-based digital artifacts still be available in a histor- 
ically accurate form — or in any form? 

A physical book has an inherent decentralized archiv- 
ability property. In order to make money on a book, its 
author or publisher must make complete copies available 
to customers. Customers in turn are free to — and can- 
not effectively be prevented from — independently stor- 
ing books for any amount of time, relocating copies to a 
safe long-term repository (e.g., a library), copying them 
to other media as the original media deteriorates, etc. 

Preservation of digital works presents many known 
challenges — principally the faster deterioration or ob- 
solescence of electronic media, and the obsolescence 
of computing environments needed to interpret old data 
formats [1,9, 15 1. Yet despite these known challenges, 
traditional software and associated documents stored on 
a floppy or hard disk, USB stick, or even a "cloud drive" 
holding raw files, still has the same decentralized archiv- 
ability property of a book. The vendor of a traditional 
software application or digital document must, in or- 
der to make money, make essentially complete copies 
available to customers, and these customers can work in 
an arbitrarily decentralized fashion using their own re- 
sources to preserve digital works deemed worth saving. 

Cloud-based applications and services, however, 
completely eliminate this property of decentralized 



archivability. Unlike users of Microsoft Office, users 
of Google Search or Maps never gain access to any- 
thing remotely resembling a "complete copy" of the en- 
tire digital artifact represented by the Google Search or 
Maps service. At most, users might save the results 
of particular queries or interactions. Unlike players of 
Doom, players of World of Warcraft (WoW) cannot in- 
dependently archive and preserve a copy of the WoW 
universe — or even a small portion of interest — because 
the provider of the cloud-based application need not, and 
typically does not, make publicly available the server- 
side software and data comprising the service. 

Given the number of scholarly papers written on the 
technological and social implications of each, it would 
be hard to argue that Google Search and WoW do not 
represent a historically significant digital artifacts. Yet 
given the rate that Google and Blizzard evolve their 
services to compete more effectively in the search and 
gaming markets, respectively, it is almost certain that 
ten years from now, no one outside these companies — 
perhaps not even anyone inside them — will be able to 
reproduce a faithful, functioning copy of the Google 
Search or WoW service as it exists today. In 100 years, 
these services will probably have evolved beyond recog- 
nition, assuming they survive at all. 

If today's digital archivists do their jobs well, in 100 
years we will be able to run today's Microsoft Word or 
play Doom (in an emulator if necessary) — but nothing 
today's digital archivists can do will preserve histori- 
cally relevant snapshots of today's cloud-based services, 
because the archivists never even get access to a "com- 
plete" snapshot for preservation. 

The historical record of today's Google Search or 
WoW will consist merely of second-hand accounts: ar- 
ticles written about them, saved search queries or screen 
shots, captured videos of particular WoW games, etc. 
While better than nothing, such second-hand accounts 
would not suffice for future historians to answer ques- 
tions such as: "How did the focus or breadth of search 
results for interesting queries evolve over the last 10 or 
100 years?" Or, "How did social-interaction and player- 
reward mechanisms change in MMOGs historically?" 

These particular examples may or may not seem inter- 
esting or important, but the point is that we don 't know 
what future historians or social scientists will deem im- 
portant about today's world. As more of today's cul- 
ture shifts to the cloud, our failure to preserve our cloud- 
based digital artifacts could produce a "digital dark age" 
far more opaque and impenetrable to future generations 
than what media or OS obsolescence alone will produce. 

5 In Search of Possible Solutions 

This paper cannot hope to — and makes no attempt to — 
offer solutions or answers to the problems outlined 



above. Instead, we merely conjecture at a few potential 
directions in which solutions might be found. 

Stabilizing Cloud Services: One place we might be- 
gin to study stability issues between interacting cloud 
services, and potential solutions, is the extensive body of 
work on the unexpected inter- AS (Autonomous System) 
interactions frequently observed in BGP routing H6][T8l. 
In particular, the "dependency wheel" model, useful for 
reasoning about BGP policy loops, seems likely to gen- 
eralize to higher-level control loops in the cloud, such 
as load balancing policies. Most of the potential solu- 
tions explored so far in the BGP space, however, appear 
largely specific to BGP — or at least to routing — and may 
have to be rethought "fram scratch" in the context of 
more general, higher-level cloud services. 

Beyond BGP, classic control theory may offer a 
broader source of inspiration for methods of understand- 
ing and ensuring cloud stability. Most conventional 
control-theoretic techniques, however, are unfortunately 
constructed from the assumption that some "master sys- 
tem architect" can control or at least describe all the 
potentially-interacting control loops in a system to be 
engineered. The cloud computing model violates this 
assumption at the outset by juxtaposing many interde- 
pendent, reactive control mechanisms that are by nature 
independently developed, and are often the proprietary 
and closely-guarded business secrets of each provider. 

Deep Resource (In)Dependence Analysis: The avail- 
ability risks discussed in Section [3] result from the fact 
that cloud service and infrastructure providers usually 
do not reveal the deep dependency structure underly- 
ing their services. The key to this risk is the non- 
transparency of the dependency graph: the application 
provider in Figure |2]c/oe5 not know that both A and B de- 
pend on the same network provider C, resulting in hid- 
den failure correlations. Supposing the providers were 
to make these dependencies visible in an explicit depen- 
dency graph, however, we might be able to estimate ac- 
tual dependence or independence between different ser- 
vices or resources for reliability analysis. 

Hardware design techniques such as fault tree analy- 
sis ll2l [T9ll may offer some tools that could be adapted 
to the purpose of reasoning about cloud service and in- 
frastructure dependencies. Consider for example a sim- 
plistic AND/OR resource dependency graph, shown in 
Figure [3] AND nodes reflect design composition and 
hence conjunctive dependency: all components under- 
neath an AND node must function correctly in order for 
the component above to operate. OR nodes reflect de- 
sign redundancy and hence disjunctive dependency: if 
any component underneath the OR node operates, the 
dependent component above the OR will operate. Given 
such a graph, annotated with expected failure rates, one 
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Figure 3: AND/OR Graph Representing Service Composition and Infrastructure Dependencies 



might compute or estimate a system's effective reliabil- 
ity after accounting for unanticipated common depen- 
dencies, such as Network Provider C in the example. 

Cloud providers may be reluctant to release detailed 
dependency information publicly for business reasons, 
but might willing release it to a trusted third party, such 
as an organization analogous to Underwriters Labora- 
tories (UL) offering cloud reliability analysis services. 
More ambitiously, cloud providers might leverage TPM- 
attested, IFC-enforcing kernels [22 1 to exchange and an- 
alyze dependency graph information, without allowing 
proprietary information to "leak" beyond this analysis. 

Preserving Cloud Artifacts: Enabling the long-term 
preservation of cloud artifacts will require solving both 
incentive problems and technical challenges. 

In a cloud-based computing model, application and 
service providers currently need not, and have little in- 
centive to, make publicly available all the software and 
data underlying the service that would be necessary for 
accurate historical preservation. Competition encour- 
ages providers to closely guard the "secret sauce" un- 
derlying their products. This incentive has long led tra- 
ditional software vendors to release their software only 
in binary form — often with deliberate obfuscation to 
thwart analysis — but only the cloud model frees the ven- 
dor entirely from the need to release their code in any 
form directly executable by the customer. Solving this 
incentive problem will likely require social, commercial, 
and/or governmental incentives for providers to make 
their cloud-based artifacts preservable in some way. 

On the technical side, cloud-based services often rely 
on enormous, frequently-changing datasets, such as the 
massive distributed databases underlying Google Search 
or Maps or an MMOG's virtual world. Even if will- 
ing, it might be impractically costly for providers to 
ship regular snapshots of their entire datasets to digi- 



tal archivists — even well-provisioned ones such as the 
Library of Congress — not to mention costly for receiv- 
ing archivists to do anything with such enormous snap- 
shots beyond saving the raw bits. A more practical ap- 
proach may be for providers themselves to be respon- 
sible for saving historical snapshots in the short term, 
using standard copy-on-write cloning and deduplicated 
storage technologies for efficiency H12II161 . After some 
time period, say 5-10 years, a select subset of these his- 
torical snapshots might then be transferred to external 
archives for long-term preservation, at considerably re- 
duced cost-per-bit in terms of both network bandwidth 
and storage due to intervening technological evolution. 
Any solution would need to address many other chal- 
lenges, such as ensuring the durability and integrity 
of online digital archives [9| and the honesty of their 
providers [17], maintaining information security of sen- 
sitive data in snapshots of cloud-based artifacts, and pre- 
serving artifacts' practical usability in addition to their 
raw bits, but we leave these issues to future work. 

6 Conclusion 

While the cloud computing model is promising and at- 
tractive in many ways, the author hopes that this paper 
has made the case that the model may bring risks beyond 
obvious information security concerns. At the very least, 
it would be prudent for us to study some of these risks 
before our socioeconomic system becomes completely 
and irreversibly dependent on a computing model whose 
foundations may still be incompletely understood. 
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